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Abstract 

We extend multi-way, multivariate ANOVA-type analysis to cases where one 
covariate is the view, with features of each view coming from different, high- 
dimensional domains. The different views are assumed to be connected by having 
paired samples; this is a common setup in recent bioinformatics experiments, of 
which we analyze metabolite profiles in different conditions (disease vs. control 
and treatment vs. untreated) in different tissues (views). We introduce a multi- 
way latent variable model for this new task, by extending the generative model of 
Bayesian canonical correlation analysis (CCA) both to take multi-way covariate 
information into account as population priors, and by reducing the dimensionality 
by an integrated factor analysis that assumes the metabolites to come in correlated 
groups. 



1 Introduction 

Finding disease and treatment effects from populations of measurements is a prototypical multi-way 
modeling task, traditionally solved with multivariate ANOVA. Here disease state (diseased/healthy) 
and treatment (treated/placebo) are the two covariates, and the research question is, are there dif- 
ferences in the population that can be explained by either covariate or, more interestingly, their 
interaction, which would hint at the treatment being effective. It is naturally additionally interesting 
what the differences are. 

A recurring problem in multi-way analyses, especially with modern high-throughput measurements 
in molecular biology, is the "small n, large p"-problem. The dimensionality p of the measurements is 
high while the number of samples n is low, and additionally the data may be collinear making estima- 
tion of the effects impossible with classical methods, univariate or multivariate linear models solved 
with multi-way ANOVA techniques. The most promising modern method, Bayesian sparse factor 
regression model (TJ, is useful in finding the variables most strongly related to the external covariate 
and to infer relationships between those variables via common latent factors. Instead of a regression 
model we will use a generative latent factor model which incorporates an assumption of clustered- 
ness of the variables to regularize the model, and makes it possible to extend the model to multi-view 
factor analysis. Such clusteredness is well justified in our application field, metabolomics, where 
due to biochemical reaction pathways the variation in concentrations of metabolite groups is highly 
correlated Q. 
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Assume that measurements have been made on the same objects but with different methods, resulting 
in different data sources possibly on different domains. An example we will analyze in this paper is 
metabolomic profiles in different tissues, where the domains are partly different since the metabolites 
cannot be fully matched. The different views form one covariate in the multi-way analysis, with the 
additional problem that the samples come from different domains and cannot be directly compared. 
We introduce a new hierarchy level of latent variables intended to decompose the views into view- 
specific and shared components, which is needed for the multi-way analysis. Such a decomposition 
is possible given that the samples in the different views come in pairs, which we need to assume. 

The resulting decomposition between the views turns out to be implementable with Bayesian canon- 
ical correlation analysis E]|4l|5l, interpretable as unsupervised multi-view modeling. Hence, in this 
work we re-interpret unsupervised multi-view modeling as one-way modeling of samples from dif- 
ferent domains, and combine it with multi-way modeling. Given that we additionally can work 
under the large p, small n conditions, the model is expected to have widespread applicability in 
current molecular biological measurements. 

2 Model 

2.1 Multi-way, multi-view 

We will generalize ANOVA to multi-view (multi-domain) analysis, restricting to two covariates and 
two views for simplicity. Using ANOVA-style notation and assuming the views to be in the same 
domain, the multivariate linear model for samples is 

w d = a a + (3 b + (a(3) ab + J d + (aj) ad + (/3-y) bd + (af3j) abd + noise, (1) 

where a and b {a = 0, ... A and 6 = 0,... B), are the two traditional independent covariates such 
as disease and treatment, and d denotes the view. 

For different values of d the domain of v d may vary, meaning different feature spaces with different 
dimensionalities. We assume the samples of the different views to come in pairs, v = [x, y]. For the 
rest of the paper we will change the notation for clarity to v 1 = x, v 2 = y, and assume a mapping 
f x from the effects to the domain of x which is linear for now. Then, 

x = r ( Qo + b + (a/3) o6 ) + f ((a)* + (/9)£ + (a/3)* 6 ) + noise, (2) 

assuming y d = 0, because it does not make sense to compare means of different domains, and that 
the view-specific effects are in the same domain as the view-independent effects and hence need to 
be transformed with the same function. The equation for y is analogous. 

To our knowledge, there exists no method capable of studying the view-independent, and view- 
dependent effects. In the next section we will introduce a model which will additionally assume that 
the effects may be uncertain, resulting in a hierarchical Bayesian model. 

2.2 Hierarchical model 

We next formulate a hierarchical latent- variable model for the task of multi-way, multi-view learning 
under "large p, small n" conditions. For this we need three components: (i) regularized dimension 
reduction, (ii) combination of different data domains, and (iii) multi-way analysis. We formulate 
each of these as part of a big generative model. We will first summarize the main components of the 
model shown in Figure [T[ and then describe each part in detail. 

To deal with the small sample size n <C p problem, we reduce the dimensionality of the data x and 
y from the two views into their respective latent variables x' at and y lat . This is done with factor an- 
alyzers which assume that the variables come in groups, which is a strongly regularizing assumption 
effective under the "large p, small n" conditions. The clustering assumption is particularly sensible 
under the assumption that metabolomics data, our main application, contains strongly correlated 
groups of variables Q. 

The second necessary element is search for a view shared by the two different domains x and y, 
needed for finding shared multi-way effects. Given paired data, this is a task for Bayesian CCA 
(BCCA) EJIHI5] which introduces a new hiearchy level where a latent variable z captures the shared 
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Figure 1 : The hierarchical latent-variable model for multi-way, multi-view learning under "large p, 
small n" conditions. 

variation between the views. The view-specific variation has been implicitly modeled by view- 
specific latent variables which have been integrated out, resulting in flexible covariance matrices 
parameterized by the 

The third necessary element, the ANOVA-type two-way analysis is supplemented by assigning the 
effect terms as priors on the latent variables z; in normal BCCA the prior is zero-mean. The observed 
covariates a and b choose the correct effects for each sample. The covariates hence effectively 
change the means of the data as in eqn Q, and the variation around the mean is modeled with 
the rest of the model. The central differences from (|2]) are that the model is hierarchical, implying 
that the arguments of the linear function f x have a distribution, and that the "noise" is structured, 
stemming from all the latent variables. With these additions, the model will be better able to take 
into account the uncertainty in the data. 

The posterior is computed with Gibbs sampling. The Gibbs-formulas are included in the supple- 
mentary material. 

In effect the model, shown in Figure [T] consists of two factor analyzers, where the loadings assume 
cluster memberships (multiplied with scales), a generative model of CCA, and population-specific 
priors on z that assume ANOVA-type multi-way structure. We will now introduce the details of each 
of these parts in turn. 

2.2.1 Factor analysis model 

We need to reduce dimensionality, which can be done by factor analysis (FA). The model |6| for n 
exchangeable replicates is 

Xj ~#(/x x + V^xf,A I ) . (3) 

Here V x is the projection matrix that is assumed to generate the data vector Xj from the latent 
variable x^ at . The x^ a * is a latent variable vector, whose elements are known as factor scores. 

The V^xJ"' models such common variance of the data around the variable-means fi x that can be 
explained by factors common to all or many variables, effectively estimated from the sample covari- 
ance matrix of the dataset. The sample covariance becomes decomposed into £ = V I V lT + A x , 
where A x is a diagonal residual variance matrix with diagonal elements of, modelling the variable- 
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specific noise not explained by the latent factors. The covariance matrix of x. lat , 
the CCA. 



^ x , comes from 



At this point, when n < p, V x cannot be estimated due to the singularity of the sample covariance 
matrix. To overcome the n <C p problem, we now restrict V 1 to a non-singular clustering matrix, 
suitable for data containing highly correlated groups of variables. 

2.2.2 Projection matrix that assumes grouped variables 

We make the structured assumption that there are strongly correlated groups of variables in the data, 
the generated values within the whole group being governed by one latent variable. The projection 
matrix V x is positive-valued, each row having one non-zero element corresponding to the cluster 
assignment of the variable, 

Ai 
A 2 



V 1 



A, 



+i 



(4) 



The location of the non-zero value on row i, v i7 follows a multinomial distribution with one ob- 
servation, with an uninformative prior distribution TTi. The m could also be used to encode prior 
information on the known grouping of variables. The variation of each variable within a cluster 
is assumed to be modeled by the same latent variable, but the scales A, may differ. The variable- 
specific residual variances of, that are the diagonal elements of A, follow a scaled Inv-^ 2 with an 
uninformative prior. 

In summary, we regularize the covariance matrix by assuming that the main correlations are positive 
correlations between variables belonging to the same cluster. This correlation is mediated through 
a common latent variable; this is a reasonable assumption for metabolomics data and, furthermore, 
facilitates interpretation of the results. 

2.2.3 Generative model of CCA 

We need to combine different data domains, and for paired data that can be done with CCA. The 
generative model of BCCA has been formulated [3 , 7] for sample j as 



*i~JV(0,I), 



(5) 



and likewise for y. Here we have assumed no mean parameter since the mean of the data is estimated 
in the factor analysis part. The W 1 is a projection matrix from the latent variables Zj, and *& x is a 
matrix of marginal variances. The crucial thing is that the latent variables z are shared between the 
two data sets, while everything else is independent. The prior distributions were chosen as 



W; ~JV(0,AI), 

0i~IG(a o ,0o), 

y x ,yy ~ iw(s ,vo). 



(6) 



Here w; denotes the Zth column of W, and IG and IW are shorthand notations for the inverse 
Gamma and inverse Wishart distributions. The priors for the covariance matrices *$> x and \E rJ/ are 
conventional conjugate priors, and the prior for the projection matrices is the so-called Automatic 
Relevance Determination (ARD) prior used for example in Bayesian principal component analysis 
0. 



2.2.4 ANOVA-type model for latent variables. 

We assume that the ANOVA-type effects act on the latent variables z, which allows access to effects 
found in both the spaces x lat and y lat . They are modeled as population priors to the latent variables, 
which in turn are given Gaussian priors a a , (3 h , (a(3) a b ~ jV(0, I). 
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Figure 2: The graphical model describing the decomposition of covariate effects into shared and 
view-specific ones. The figure expands the top part of Figure [T] leaving out the feature extraction 
part and some parameters. 

In the K z -dimensional latent variable space we then have 

Zj = OL a + (3 b + (a(3)ab + £j, (7) 

where ej is a noise term. Note that the grand means are estimated in the lower level of hierarchy, 
that is, directly in the x and y-spaces, and do not appear here. 

To simplify the interpretation of the effects we center the grand means to the mean of one control 
population. A similar choice has been done successfully in other ANOVA studies [9], and it does 
not significantly sacrifice generality. We set the parameter vector fi x , describing variable-specific 
means, to the mean of the control group. One group now becomes the baseline to which other classes 
are compared by adding main and interaction effects. For convenience, we will additionally change 
the variables compared to the standard ANOVA convention, such that the terms c*o, ot/3 m , 
(o;/3)o6, and (af3) a Q are not estimated. The differences between the populations are now modelled 
directly with x. lat and y lat , and hierarchically by the main effects a a , f3 b , (af3) ab , a, b > 0. 

In our case study, a and b have only two values and we have populations (a, 6) = 
(0, 0), (1, 0), (0, 1), (1, 1), and there are hence three terms a\, (3 1 and (ct/3)n, that model the dif- 
ference to the control population (a, b) = (0, 0). 

In summary, the complete hierarchical model of Figure[T]is 

a = 0,(3 Q = 0, (aP) a o = 0, (a(3) Qb = 
a a ,/3 b ,(aj3) ab ~JV(0,I) 
Zj\jea,b ~ N(a a +(3 b + (a/3) a6 ,I) 

x rf "A/V+Vx^.A*). (8) 
2.3 Decomposing covariate effects into shared and view-specific 

So far we have not discussed how the model finds the view-related effects or, in our application, 
tissue effects a£, (3 b , and (af3) x b , and likewise for y. 

The Bayesian CCA assumes that the data is generated by a sum of view-specific latent variables 
z x and z v , and shared latent variables z, and the former have been integrated out in the graphical 
model of Figure[T] The way to implement the view-specific effects is to assign them as priors to the 
view-specific latent variables. Then we do not want to integrate them out but include them explicitly 
in the model as shown in Figure [2] 

As a technical note, to make computation of the model faster and more reliable, we have further 
included view-specific latent variables that do not have disease or treatment effects. They have been 
integrated out, resulting in the covariance parameters in Figure[2] Their role is to explain away all 
or most of the variation that is unrelated to the disease and treatment effects, so that Gibbs sampling 
does not need to model all that variation. This trick should not change modeling results in the limit 
of an infinite time for computation. 
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In practice the decomposition in Figure [2]is implemented by restricting a column of W K to be zero 
for the y-specific components and vice versa for x. 

2.4 Data preprocessing and model complexity selection 

For simplicity and to reduce the number of parameters of the model, the data is preprocessed such 
that for each variable the mean of the control population a = 0, b = is subtracted and the variable 
is scaled by the standard deviation of the control population. This fixes the scales \i to one and the 
fj, x and n v to zero. The factor analysis part now models correlations of the variables. The possible 
covariate effects are now comparable to the control population as discussed in Chapter [2.2.4| 

Model complexity, that is, the number of clusters and latent variables, is chosen separately for both 
x iat an( j yiat p rec jictive likelihood in 10-fold cross-validation. 

3 Results 

We demonstrate the working of the method on generated data, and apply it to a disease study where 
lipidomic profiles have been measured from several tissues of model mouse samples, under a two- 
way experimental setup (disease and treatment), the two feature spaces (lipid profiles) are distinct 
and samples paired. 

3.1 Generated data 

We generate data having known effects, and then study how well the model finds the effects as a 
function of the number of measurements. There are three effects, in a, (3 V , and (a(3) x . 

Each of the three effects have strength +2, the x' at and y lat are both 3-dimensional, and the x and 
y are 200-dimensional. The = 1 for each variable i in x and y. The model is computed by Gibbs 
sampling, discarding 1000 burn-in samples, and collecting 1000 samples for inference. To fix the 
sign of the effects without affecting the results, each posterior distribution is mirrored, if necessary, 
to have a positive mean, i.e. multiplied by the sign of the posterior mean. 

The method finds the three generated effects, shown in Figure [3] The uncertainty decreases with 
increasing number of observations. The shared effect is found with much less uncertainty since 
there is evidence from both views. With low numbers of samples, there is considerable uncertainty 
in the effects for view-specific components. In typical bioinformatics applications there may be 
20-50 samples. 

3.2 Lung cancer study 

We then study data from a two-way, two-view, n <C p, so far unpublished lung cancer mouse 
model experiment. The diseased mice are compared to healthy control samples and, in addition, 
some mice from both groups have been given a test anticancer drug treatment. There are thus 
healthy untreated (9 mice/samples), diseased untreated (7), healthy treated (6) and diseased treated 
(6) samples. Lipidomic profiles have been measured by Liquid Chromatography Mass Spectrometry. 
The study has a two-way experimental setup, such that disease effect a, treatment effect (3 and an 
interaction effect a(3 on lipid groups are to be estimated. The high-dimensional lipidomic profiles 
have been measured from several tissues of each mouse; the tissues have partly different lipids that 
have not been matched, and even the roles of the matched lipids may be different in different tissues. 
Hence, the tissues have different feature spaces with paired samples, implying a two-view study. We 
will specifically study the relationship between blood and lung tissue, which is the most interesting 
for diagnosis, since blood can be easily sampled. 

3.2.1 Experiment 1. Effects shared by blood and disease tissue 

Blood plasma (168 lipids) and lung tissue (68 lipids) were integrated with the method. The optimal 
number of clusters for plasma was 6 and for lung 5, found by predictive likelihood. The method finds 
a disease effect a and treatment effect (3 shared by both views (Fig. [4]). The effect can be traced back 



6 



T3 

CD 
i_ 

rO 

in 



u 
<£ 
u 

CD 

a. 

to 




8° 



CO_ c 




no o 



o o o o o 

<M LO O O O 

7- cm in 




n samples 



o o o o o 

CM LO O O O 
1 — C\J LO 




o o o o 

CM LO O O 

7- CM LO 




O O O O O 
CM LO O O O 
1 — CM LO 




X 










* CM - 












• - / 








CO- - 










8 C V- 












i i i 


- 1 — 






o o o 


o 


o 




CM LO O 


o 


o 






CM 


LO 




n samples 



o o o o 

CM LO O O 

t— CM LO 

n samples 



Figure 3: The method finds the generated effects a — +2,(3 y = +2, and (af3) x = +2. (effect 
subscripts i and n have been dropped in the rest of the results section). The dots show posterior 
mean and the thin lines include 95% of posterior mass, as a function of number of observations. A 
consistently non-zero posterior distribution implies an effect found. 



to the metabolite groups, by first identifying the responsible row of W 1 and hence component of 
x' at , and then the metabolite cluster from the V x corresponding to the x' at component. 

The results imply that a cluster of 12 lipids in lung and a cluster of 20 lipids in blood are mutually 
coherently up-regulated due to disease, and additionally up-regulated by the treatment. Another 
cluster of 13 lipids in lung was found down-regulated due to the disease and additionally down- 
regulated due to treatment. The lipids of the down-regulated cluster are thus negatively correlated 
with the up-regulated clusters. The results show that since no consistent interaction term (a/3) is 
found, there is no indication that the treatment would cure the cancer effects. This confirms our 
prior fear that the specific treatment might not be efficient. The treatment does, however, affect the 
same groups of lipids as the disease, so investigating it as a potential cure was not a far-fetched 
hypothesis. 

The up-regulated cluster of blood plasma contains abundant triglycerides known to be coregulated, 
the up-regulated cluster of lung contains lipotoxic ceramides ifTUl and proinflammatory lysophos- 
phatidylcholines [ 1 1 1, while the down-regulated cluster of lung contains ether lipids, known as en- 
dogenous antioxidants lfT2l . Our analysis reveals that the drug treatment enhances, not diminishes, 
the proinflammatory lipid profile found in the disease. 

3.2.2 Experiment 2. When connected to non-diseased tissue, only view-specific effects are 
found 

We then integrate plasma x with another tissue, heart (58 lipids) y. The results ( Figure[4|i, show that 
the disease effect and the treatment effect are found only in the view-specific component of plasma. 
This implies that there are no shared effects between plasma and heart, and in fact no consistent 
effects are found for the heart tissue. The method finds, however, the same effects, a and (3 in the 
plasma tissue as in Experiment 1 and for the same cluster of lipids, which is a sign that the method 
works well. 



4 Discussion 



We have generalized ANOVA-type multi-way analysis to cases where multiple views of samples 
having a multi-way experimental setup are available. The problem is solved by a hierarchical latent 
variable model that extends the generative model of bayesian CCA to model multi-way covariate in- 



7 




Experiment 1 Experiment 2 



Figure 4: In experiment 1 (left), the method finds a disease effect a and a treatment effect j3 shared 
between the two views, plasma (x) and lung (y) tissues. In experiment 2 (right), only view-specific 
effects are found for plasma (x) when integrating with the heart tissue (y). No effects are found in 
heart. The boxplots show quartiles and 95% intervals of posterior mass of the effects, a consistently 
non-zero posterior distribution implies an effect found. 



formation of samples by having population-specific priors on the shared latent variable of CCA. Fur- 
thermore, the method is able to decompose the covariate effects to shared and view-specific effects, 
treating the multiple views as one covariate. Finally, the method is designed for cases with high di- 
mensionality and small sample-size, common in bioinformatics applications. The small sample-size 
problem was solved by assuming that the variables come in correlated groups, which is reasonable 
for the metabolomics application. 

The modelling task is extremely difficult due to the complexity of the task and small sample- 
size. Hence it was striking that the method was capable of finding covariate effects in a real-world 
lipidomic multi-view, multi-way dataset. 

In this work it was possible to estimate only three components, because the number of samples 
was extremely low: a shared component and two view-specific components. If more than one 
shared components are to be estimated, an unidentifiability problem occurs, since there is a rotational 
ambiguity within the solution subspace. The problem can be solved by a deflation-type method, 
where the components are computed one by one. Each posterior sample is now considered as a 
converged starting point, and a second component is added and the model is sampled with having 
the first component fixed. The last sample of each new sampled chain is collected for inference. 
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